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ABSTRACT 



A new language model for speech recognition inspired by linguistic analysis 
is presented. The model develops hidden hierarchical structure incrementally 
and uses it to extract meaningful information from the word history — thus 
enabling the use of extended distance dependencies — in an attempt to com- 
plement the locality of currently used n-gram Markov models. The model, 
its probabilistic parametrization, a reestimation algorithm for the model pa- 
rameters and a set of experiments meant to evaluate its potential for speech 
recognition are presented. 

1 INTRODUCTION 

The task of a speech recognizer is to automatically transcribe speech into text. The 
most successful approach to speech recognition so far is a statistical one [|1|: given the 
observed string of acoustic features A, find the most likely word string W among those 
that could have generated A: 

W = argmaxwP{W\A) = argmaxwP{A\W) ■ P{W) (1) 

This paper is concerned with the estimation of the language model probability P{W). 
We will first describe current modeling approaches to the problem, followed by a detailed 
explanation of our model. A few preliminary experiments that show the potential of our 
approach for language modeling will then be presented. 



2 BASIC LANGUAGE MODELING 

The language modeling problem is to estimate the source probability P{W) where 
W = Wi,W2, . . . ,Wn is a sequence of words. This probability is estimated from a text 
training corpus. Usually the model is parameterized: Pg{W),6 G where is referred 
to as the parameter space. Due to the sequential nature of an efficient search algorithm, 
the model operates left-to-right, allowing the computation 

n 

P{Wi, W2, • • • , Wn) = P{Wi) ■ Yl P{Wi/ Wi... Wi_i) (2) 
i=2 

tThis work was funded by the NSF IRI-19618874 grant STIMULATE 



We thus seek to develop parametric conditional models: 



Pe{wi/wi...Wi-i),9 eQ,WieV (3) 

where V is the vocabulary chosen by the modeler. Currently most successful is the n-gram 
language model: 

Pe{wi/wi . . . Wi^i) = Peiwi/wi-n+i ■ ■ ■ ti'i-i) (4) 

2.1 LANGUAGE MODEL QUALITY 

All attempts to derive an algorithm that would estimate the model parameters so 
as to minimize the word error rate have failed. As an alternative, a statistical model is 
evaluated by how well it predicts a string of symbols Wt — commonly named test data 
— generated by the source to be modeled. 

2.1.1 Perplexity 

Assume we compare two models Mi and M2; they assign probability PMiiWt) aiid 
PhhiWi), respectively, to the sample test string Wt. "Naturally", we consider Mi to be a 
better model than M2 if PuiiWi) > PM2(Wt). The test data is not seen during the model 
estimation process. 

A commonly used quality measure for a given model M is related to the entropy of 
the underlying source and was introduced under the name of perplexity (PPL) p: 

AT 

PPL{M) = expi-l/\Wt\J2\n[PMiWt)]) (5) 

i=l 

2.2 SMOOTHING 

Assume that our model M is faced with the prediction Wi\wi . . . Wi-i and that Wi has 
not been seen in the training corpus in context wi . . . Wi-i which itself has possibly not 
been encountered in the training corpus. If PM{wi\wi . . . = then Pm{wi . . . wn) = 
thus forcing a recognition error; good models are smooth, in the sense that 
3e(M) > s.t. PM{wi\wi . . .Wi_i) > e,^Wi e V, {wi . . G V*"^ 

One standard approach that ensures smoothing is the deleted interpolation method 0]. 
It interpolates linearly among contexts of different order 

k=n 

Pe{wi\wi-n+i ■ ■ ■ Wi-i) = ^>^k- f{wi/hk) (6) 

fe=0 

where: hk = Wi-k+i ■ ■ .Wj-i is the context of order k when predicting Wi] f{wi/hk) is the 
relative frequency estimate for the conditional probability P{wi/hk); \k,k = 0. . .n are 
the interpolation coefficients satisfying > 0, A; = . . . n and J2k=o = 1- 

The model parameters 6 then are: the counts C{hn,Wi) — lower order counts are 
inferred recursively by: C{hk,Wi) = J2wi_kev C{wi-k, hk,Wi) — and the interpolation 
coefficients Xk, k = . . .n. 

A simple way to estimate the model parameters involves a two stage process: 

1. gather counts from development data — about 90% of training data; 



2. estimate interpolation coefficients to minimize the perplexity of check data — the 
remaining 10% of the training data. 

Different smoothing techniques are also used e.g., maximum entropy or back-off 

3 DESCRIPTION OF THE STRUCTURED LANGUAGE MODEL 

The model we present is closely related to the one investigated in [0], however different 
in a few important aspects: 

• our model operates in a left-to-right manner, thus allowing its use directly in the 
hypothesis search for W in (|l|); 

• our model is a factored version of the one in , thus enabling the calculation of the 
joint probability of words and parse structure; this was not possible in the previous 
case due to the huge computational complexity of the model. 

3.1 THE BASIC IDEA AND TERMINOLOGY 

Consider predicting the word after in the sentence: 
the contract ended with a loss of 7 cents after trading as low as 89 cents 
A 3-gram approach would predict after from (7, cents) whereas it is intuitively clear 
that the strongest word-pair predictor would be contract ended which is outside the 
reach of even 7-grams. Our assumption is that what enables humans to make a good 
prediction of after is the syntactic structure of its sentence prefix. The linguistically 
correct partial parse of this prefix is shown in Figure |l|. 

A binary branching parse for a string of words is a binary tree whose leaves are the 
words. The headword annotation makes the tree an oriented graph: at each node we have 
two children; the current node receives a headword from either child; one arrow suffices 
to describe which of the children — left or right — is percolated to become the headword 
of the parent. It was found that better parse trees are generated when using tags: part- 
of-speech(POS) tags for the leaves and non-terminal(NT) tags for the intermediate nodes 
in the parse tree. Any subtree identifies a constituent. The word ended is called the 
headword of the constituent (ended (with (...))) and ended is an exposed headword 
when predicting after — topmost headword in the largest constituent that contains it. 
The syntactic structure in the past filters out irrelevant words and points to the important 
ones, thus enabling the use of long distance information when predicting the next word. 




the_DT contract_NN ended_VBD withJN a_DT loss_NN of_IN 7_CD Lfnls_NNS after 



Figure 1: Partial parse 



Our model will attempt to build the syntactic structure incrementally while travers- 
ing the sentence left-to-right; it will assign a probability P{W,T) to every sentence W 
with every possible POStag assignment, binary branching parse, non-terminal tag and 
headword annotation for every constituent of the parse tree T. 

Let be a sentence of length n words to which we have prepended <s> and appended 
</s> so that wq =<s> and Wn+i =</s>. Let Wk be the word k-prefix wq . . .Wk of the 
sentence and WkT^ the word-parse k-prefix. A word-parse k-prefix contains — for a given 
parse — only those binary subtrees whose span is completely included in the word k- 
prefix, excluding wq =<s>. Single words along with their POStag can be regarded as 
root-only subtrees. Figure || shows a word-parse k-prefix; h_0 . . h_{-m} are the exposed 
heads, each head being a pair (headword, non-terminal tag), or (word, POStag) in the 
case of a root-only tree. 



h_{-m} =(<s>, SB) 



h_0 = Qi_0.word, h_0.tag) 




(<s>, SB) 



(w_p, t_p) (w_{p+l), t_{p+l)) (w_k, t_k) w_{k+l}.... </s> 



Figure 2: A word-parse k-prefix 

A complete parse — Figure ^ — is a binary parse of the 
(<s>, SB) {wi,ti) . . . {wn,tn) (</s>, SE) sequeuce with the foUowiug two restrictions: 

1. {wi,ti) . . . {wn,tn) (</s>, SE) is a constituent, headed by (</s> , TOP'); 

2. (</s>, TOP) is the only allowed head. Note that {{wi,ti) . . . {wn,tn)) needn't be 
a constituent, but for the parses where it is, there is no restriction on which of 
its words is the headword or what is the non-terminal tag that accompanies the 
headword. 

Our model can generate all and only the complete parses for a string 

(<S>, SB) {Wi,ti) ...{Wn,tn) (</s>, SE) . 




(<s>, SB) (w_l,t_l) (w_n, t_n) (</s>, SE) 



Figure 3: Complete parse 
The model will operate by means of three modules: 

• WORD-PREDICTOR predicts the next word Wk+i given the word-parse k-prefix 
WkTk and then passes control to the TAGGER; 

• TAGGER predicts the POStag tk+i of the next word given the word-parse k-prefix 
and the newly predicted word Wk+i and then passes control to the PARSER; 



• PARSER grows the already existing binary branching structure by repeatedly gen- 
erating the transitions: (adjoin-left, NTtag) or (adjoin-right, NTtag) until 
it passes control to the PREDICTOR by taking a null transition. NTtag is the 
non-terminal tag assigned to each newly built constituent and {left, right} spec- 
ifies from where the new headword is inherited. The parser operates always on the 
two rightmost exposed heads, starting with the newly tagged word w^+i- 

The operations performed by the PARSER are illustrated in Figures and they 
ensure that all possible binary branching parses with all possible headword and non- 
terminal tag assignments for the wi . . .Wk word sequence can be generated. It is easy 



h_(-2) h_{-l) h_0 



T_{-ml 




Figure 4: Before an adjoin operation 

h'_(-l| =h_|-21 h'J) = (h_|-ll.word, NTlag) 



T'_l-m+ll<-<s> 




Figure 5: Result of adjoin-left under NTtag 

h'_| -1 |=h_|-2 1 h'_0 = (h_0.word, NTtag) 



T'_|-m+l 1<-<S> 




Figure 6: Result of adjoin-right under NTtag 

to see that any given word sequence with a possible parse and headword annotation is 
generated by a unique sequence of model actions. 

3.2 PROBABILISTIC MODEL 

The probability P{W, T) of a word sequence W and a complete parse T can be broken 
into: 

P{W, T) = 

nitl[P{wk/Wk-iTk-i) ■ P{tk/Wk^iTk-uWk) ■ n£\ P{p'^/Wk-iTk^u Wk, 4, • • • 
where: 

• Wk-iTk-i is the word-parse {k — l)-prefix 

• Wk is the word predicted by WORD-PREDICTOR 

• tk is the tag assigned to Wk by the TAGGER 



• N^ — l is the number of operations the PARSER executes at position k of the input 
string before passing control to the WORD-PREDICTOR (the N/^-th operation at 
position k is the null transition); A^^^ is a function of T 

• pf denotes the i-th PARSER operation carried out at position k in the word string; 

e { (adjoin-left, NTtag), (adjoin-right, NTtag)},l < i < Nk , 
=null, i = Nk 

Each {Wk-iTk-i,Wk,tk,Pi. . -Pi^i) is a vahd word-parse k-prefix WkT^ at position k 
in the sentence, i = 1, A^fc. 

To ensure a proper probabihstic model certain PARSER and WORD-PREDICTOR 
probabilities must be given specific values: 

• P{null/WkTk) = 1, if h_{-l}.word = <s> and h_{0} ^ (</s>, TOP') — that is, 
before predicting </s> — ensures that (<s>, SB) is adjoined in the last step of the 
parsing process; 

• P( (adjoin-right, TDP) /WkTk) = 1, 

ifh_0 = (</s>, TOP') and h_-[-l}. word = <s> and 
P( (adjoin-right, TOP') /WkTk) = 1, 

if h_0 = (</s>, TOP') and h_-[-l}- . word 7^ <s> ensure that the parse generated 
by our model is consistent with the definition of a complete parse; 

• 3e > 0,WWk-iTk^i, P{wk=</s>/Wk~iTk-i) > e ensures that the model halts with 
probability one. 

In order to be able to estimate the model components we need to make appropriate 
equivalence classifications of the conditioning part for each component, respectively. 

The equivalence classification should identify the strong predictors in the context and 
allow reliable estimates from a treebank. Our choice is inspired by [Q: 

P{wk/Wk-iTk-i) = P{wk/[Wk-iTk-i]) =P{wk/ho,h^,) (8) 
P{tk/wk,Wk-iTk-i) = P{tk/wk,[Wk-iTk-i]) = P{tk/wk,ho.tag,h_i.tag) (9) 
Pipl/WkTk) = Pip^/[WkTk]) = Pipl/K, h_,) (10) 

It is worth noting that if the binary branching structure developed by the parser were 
always right-branching and we mapped the POStag and non-terminal tag vocabularies to 
a single type then our model would be equivalent to a trigram language model. 

3.3 SMOOTHING 

All model components — WORD-PREDICTOR, TAGGER, PARSER — are condi- 
tional probabihstic models of the type P{y/xi, X2, ■ ■ ■ , Xn) where y, xi, X2, ■ ■ ■ ,Xn belong 
to a mixed bag of words, POStags, non-terminal tags and parser operations {y only). 
For simplicity, the smoothing method we chose was deleted interpolation among relative 
frequency estimates of different orders /«(■) using a recursive mixing scheme: 

Piy/xi, ...,Xn) = 

A(xi, . ..,Xn) ■ P{y/xi, . . .,Xn-i) + (1 " A(a;i, . . . ,a;„)) ■ fn{y/xi, . . .,Xn), (11) 
f-i{y) = uniform{vocabulary{y)) (12) 



The A coefficients are tied based on the range into which the count C(xi, . . . ,x„) falls. 
The approach is a standard one [0. 



3.4 PRUNING STRATEGY 

Since the number of parses for a given word prefix Wk grows exponentially with k, 
\{Tk}\ ~ 0(2^^), the state space of our model is huge even for relatively short sentences. 
We thus have to prune most parses without discarding the most likely ones for a given 
sentence W. Our pruning strategy is a synchronous multi-stack search algorithm. 

Each stack contains hypotheses — partial parses — that have been constructed by the 
same number of predictor and the same number of parser operations. The hypotheses in 
each stack are ranked according to the ln{P{Wk,Tk)) score, highest on top. The width of 
the search is controlled by two parameters: 

• the maximum stack depth — the maximum number of hypotheses the stack can 
contain at any given time; 

• log-probability threshold — the difference between the log-probability score of the 
top-most hypothesis and the bottom-most hypothesis at any given state of the stack 
cannot be larger than a given threshold. 



3.5 WORD LEVEL PERPLEXITY 

Attempting to calculate the conditional perplexity by assigning to a whole sentence 
the probability: 

n 

P{W/T*) = n P{wk+i/W,T*), (13) 

A:=0 

where T* = argmaxTP{W, T) — the search for T* being carried according to our pruning 
strategy — is not valid because it is not causal: when predicting Wk+i we would be using 
T* which was determined by looking at the entire sentence. To be able to compare the 
perplexity of our model with that resulting from the standard trigram approach, we need 
to factor in the entropy of guessing the prefix of the final best parse before predicting 
Wk+i, based solely on the word prefix Wk- 

To maintain a left-to-right operation of the language model, the probability assignment 
for the word at position A; -|- 1 in the input sentence was made using: 

p{wk+i/Wk)= E p{wk+i/Wkn)-p{Wk,n), (14) 

p{Wk,Tk)=P{WkTk)/ Pi^kW (15) 

where Sk is the set of all parses present in our stacks at the current stage k. 

Note that if we set p(Wk,Tk) = 6(Tk,T^\Wk) — 0-entropy guess for the prefix of the 



parse Tk to equal that of the final best parse — the two probability assignments ([T3| ) 
and (ffl) would be the same, yielding a lower bound on the perplexity achievable by our 



model when using a given pruning strategy. 

A second important observation is that the next-word predictor probability 
P{wk+i/WkTk) in (|1|) need not be the same as the WORD-PREDICTOR probability (|) 
used to extract the structure T^, thus leaving open the possibility to estimate it separately. 



3.6 PARAMETER REESTIMATION 

3. 6. 1 First Model Reestimation 

Our parameter re-estimation is inspired by the usual EM approach. Let (W, T^^^), k = 
1,2, . . . , N denote the set of parses of W that survived our pruning strategy. Each 
parse was produced by a unique sequence of model actions: predictor, tagger, and parser 
moves. The collection of these moves will be called a derivation. Each of the N mem- 
bers of the set is produced by exactly the same number of moves of each type. Each 
move IS uniquely specified by identifiers (?/(™), z^*")), where m e {WORD-PREDICTOR, 
TAGGER, PARSER} denotes the particular model, the specification of the par- 

ticular move taken (e.g., for m =PARSER, the quantity y^"^^ specifies a choice from 
{left, right, null} and the exact tag attached), and x^^^ specifies the move's context 
(e.g., for m =PARSER, the two heads). 

For each possible value {y^"^\ x^"^^) we will establish a counter which at the beginning of 
any particular iteration will be empty. For each move {y^"^\ x^"^^) present in the derivation 
of {W,T^^^) we add to the counter specified by {y^"^\x^^^) the amount 



where P{W,T^^^) are evaluated on the basis of the model's parameter values established 
at the end of the preceding iteration. We do that for all {W, T^^^),j = 1,2, . . . , N and for 
all sentences W in the training data. Let C^"^\y^"^\ x^"^'^) be the counter contents at the 
end of this process. The corresponding relative frequency estimate will be 

I- ' E.(™)C(™)(zM,a;M) 

The lower order frequencies needed for the deleted interpolation of probabilities in the 
next iteration are derived in the obvious way from the same counters. 

It is worth noting that because of pruning (which is a function of the statistical param- 
eters in use), the sets of surviving parses {W,T^^^),k = l,2,...,N for the same sentence 
W may be completely different for different iterations. 



3.6.2 First Pass Initial Parameters 

Each model component — WORD-PREDICTOR, TAGGER, PARSER — is initialised 
from a set of hand-parsed sentences, after each parse tree (W, T) is decomposed into its 
derivation{W,T). Separately for each m model component, we: 

• gather joint counts (7(''")(y(™'\a;^™)) from the derivations that make up the "devel- 
opment data" using p{W,T) = 1; 

• estimate the deleted interpolation coefficients on joint counts gathered from "check 
data" using the EM algorithm 0. These are the initial parameters used with the 
reestimation procedure described in the previous section. 



3. 6. 3 Language Model Refinement 

In order to improve performance, we develop a model to be used in (0), different from 
the WORD-PREDICTOR model (|). We will call this new component the L2R- WORD- 
PREDICTOR. 



The key step is to recognize in (H) a hidden Markov model (HMM) with fixed tran- 
sition probabihties — although dependent on the position in the input sentence k — 
specified by the p(Wk,Tk) values. 

The Expectation-step of the EM algorithm ||^ for gathering joint counts 
CM(^M^^(m))^ ^ = L2R- WORD-PREDICTOR-MODEL, is the standard one whereas 
the Maximization-step uses the same count smoothing technique as that descibed in sec- 
tion 13.6.1. 



The second reestimation pass is seeded with the m = WORD-PREDICTOR model 
joint counts C^"^\y^"^\ x^"^^) resulting from the first parameter reestimation pass (see 
section |3.6.1| ) . 



4 EXPERIMENTS 

We have carried out the reestimation technique described in section 3^ on 1 Mwds of 
"development" data. For convenience we chose to work on the UPenn Treebank corpus 
— a subset of the WSJ (Wall Stree Journal) corpus. The vocabulary sizes were: 
word vocabulary: 10k, open — all words outside the vocabulary are mapped to the <unk> 
token; POS tag vocabulary: 40, closed; non-terminal tag vocabulary: 52, closed; parser 
operation vocabulary: 107, closed. The development set size was 929,564wds (sections 
00-20), check set size 73,760wds (sections 21-22), test set size 82,430wds (sections 23-24). 

Table |l| shows the results of the reestimation techniques presented in section E? 
and L2R? denote iterations of the reestimation procedure described in sections p.6.1| and 
3.6.31 , respectively. A deleted interpolation trigram model had perplexity 167.14 on the 
same training-test data. 



iteration 
number 


DEV set 
L2R-PPL 


TEST set 
L2R-PPL 


EO 


24.70 


167.47 


El 


22.34 


160.76 


E2 


21.69 


158.97 


E3 = L2R0 


21.26 


158.28 


L2R5 


17.44 


153.76 



Table 1: Parameter reestimation results 



Simple linear interpolation between our model and the trigram model: 

Q{wk+i/Wk) = A ■ P{wk+i/wk-i, Wk) + (1 - A) • P{wk+i/Wk) 

yielded a further improvement in PPL, as shown in Table ^. The interpolation weight 
was estimated on check data to be A = 0.36. An overall relative reduction of 11% over 
the trigram model has been achieved. 

As outlined in section |3r5| , the perplexity value calculated using (p!3| ) is a lower bound for 
the achievable perplexity of our model; for the above search parameters and E3 model 
statistics this bound was 99.60, corresponding to a relative reduction of 41% over the 
trigram model. 



iteration 
number 


TEST set 
L2R-PPL 


TEST set 
3-gram interpolated PPL 


EO 


167.47 


152.25 


E3 


158.28 


148.90 


L2R5 


153.76 


147.70 



Table 2: Interpolation with trigram results 



5 CONCLUSIONS AND FUTURE DIRECTIONS 

A new source model that organizes the prefix hierarchically in order to predict the next 
symbol is developed. As a case study we applied the source model to natural language, 
thus developing a new language model with applicability in speech recognition. 

We believe that the above experiments show the potential of our approach for improved 
language modeling for speech recognition. Our future plans include: 

• experiment with other parameterizations for the word predictor and parser models; 

• evaluate model performance as part of an automatic speech recognizer (measure 
word error rate improvement). 
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